Introduction to Open Data Science - Course Project

About the project

I wanted to gain some knowledge in data analysis for quite a long time. I found this course by simply surfing through list of courses with open registration date. This is an amazing opportunity to learn basics of up to date data analysis from experts.

The textbook is very well organized and easy to learn with. However it is always not so easy to digest the information about the tools you haven’t used yet.

My github repository

https://github.com/SergeiRaik/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Dec  6 00:55:32 2022"

Regression and model validation

Introduction

This week I studied the regression models. Those are extremely important in data analysis as they bear several functions. They help us explain the relationship between variables and make a hypothesis how some parameters affect target variables. Another function is prediction. Using regression model we can predict unknown values of target variable by a set of explanatory vatiables. Here we used a dataset derived from 2014 year questionare aimed at identifying how various learning strategies (deep, surface and strategic) affect learning outcome (exam points).

1. Import the dataset from file.

# import necessary packages
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   0.3.5
## ✔ tidyr   1.2.1     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
#import dataset from file and set proper column types
learning2014 <- read_csv("data/learning2014.csv", col_types = cols(.default = "n", age = "i", gender = "f", points = "i"))

2. Dataset structure.

str(learning2014) #stucture of the dataset
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.37 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   .default = col_number(),
##   ..   gender = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   age = col_integer(),
##   ..   attitude = col_number(),
##   ..   deep = col_number(),
##   ..   stra = col_number(),
##   ..   surf = col_number(),
##   ..   points = col_integer()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(learning2014) # dimensions of the dataframe
## [1] 166   7

The dataset consists of 7 variables and 166 observations. The variable gender, which is 2 levels factor is respondents’ gender. Integers age and points are respondents’ age in years, and exam points respectively. Numerical variables attitude, deep, stra, and surf are derived from respondents’ answers and scaled to 0-5 points scale. More information on the data can be found here. Briefly they represent three learning strategies (deep, surface and strategic) and general attitude towards statistics.

3. Overview of the data.

First let’s take a look on the summary of each variable.

summary(learning2014) # dataframe summary
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

It can be seen that data related to learning strategies and attitude has the same range (0-5). The number of female participants is twice as high as male.

Let’s try to plot variables with each other and try to visually observe any correlation.

pairs(learning2014[-1]) # plot every variable against each other excluding 1st column (gender)

Due to high variation it is hard to visually observe any correlation from plots. We can therefore compare data using the ggpairs() function from ggplot2 library which provides also variable distribution plots and correlation coefficients.

# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(alpha = 0.3, col = gender), lower = list(combo = wrap("facethist", bins = 20)))

From variables distribution graphs shape we can conclude that all of them except age have more or less symmetrical bell shape. To prove that let’s make a QQ plot for each variable. Simply said, QQ plot compares the distribution with the normal distribution. The graph should be linear if two normally distributed values are compares.

par(mfrow = c(2,3)) # create a plot matrix
qqplot(learning2014$age # create a qq-plot for each variable to check if it is normally distributed
       , rnorm(50))
qqplot(learning2014$attitude
       , rnorm(50))
qqplot(learning2014$deep
       , rnorm(50))
qqplot(learning2014$stra
       , rnorm(50))
qqplot(learning2014$surf
       , rnorm(50))
qqplot(learning2014$points
       , rnorm(50))

Indeed, we can see that age is not normally distributed here. Other plots are linear.

4. Regression model

Let’s create a linear regression model where exam points are target variable and explanatory variables would be attitude, stra and surf, based on their correlation coefficients from previous section.

# create a multivariable regression model
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# create a model summary
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From p-values we can conclude that the attitude has the highest influence on the target variable. stra and surf are insignificant. Therefore we can simplify the model to a single parameter without losing much predictability.

# create a simplified model with single variable
my_model2 <- lm(points ~ attitude, data = learning2014)
# create a model summary
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Indeed, the R-squared decreased roughly from 0.20 to 0.19. However both models are not very precise as they explain ~20 % of points variability. From the regression model it can be concluded that global attitude towards statistics positively correlates with exam outcome, however the influence is not very large.

5. Diagnostic plots

Let’s have a look at the plot of exam points vs attitude.

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm") # points vs attitude linear plot
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'

Our linear model suggests that there is linear relationship between exam points and general attitude towards statistics. To prove that we can make diagnostic plots.

plot(my_model2, which = 1) # residuals vs fitted 

Residuals vs. fitted plot shows the difference between observed and fitted values. If the observed value equals exactly the predicted value, the residual value is zero. From our plot we can conclude that the assumption that the relationship is linear is reasonable as residuals are randomly distributed around the zero line. We can also observe values that do not fit the model: in observations 35, 56 and 145 students got unexpectedly low exam points despite their high attitude.

plot(my_model2, which = 2) # QQ-plot

To check if the residuals are distributed normally, we made a QQ-plot where quantiles of residuals distribution are plotted against theoretical normal distribution quantiles. In our case the graph is fairly linear, however there is some deviation at its extremes.

plot(my_model2, which = 5) # residuals vs leverage plot

The Residuals vs Leverage plot shows us that none of the points fall outside the Cook’s distance, and therefore can not be considered influential observations.


Logistic regression

Introduction

This week we were working on a logistic regression. This is the method of estimation the relationships between categorical variables. This method is an extension of linear regression. The dataset we’ve been working with is describing student achievement in secondary education of two Portuguese schools. The dataset includes information regarding family status, school performance, spare time, alcohol consumption etc. The complete description is published on the UCI Machine Learning Repository.

1. Dataset structure

Let’s import the wrangled dataset and take a look at it’s structure.

# import necessary packages
library(ggplot2)
library(dplyr)
library(GGally)
library(tidyverse)

#import dataset from file and set proper column types
alc <- read_csv("data/alc.csv", show_col_types = FALSE, col_types = cols(.default = "f", age = "i", Medu = "i", Fedu = "i", traveltime = "i", studytime = "i", famrel = "i", freetime = "i", goout = "i", Dalc = "i", Walc = "i", health = "i", failures = "n", absences = "n", G1 = "n", G2 = "n", G3 = "n", alc_use = "n", high_use = "l"))

#check dataset structure, dimensions and column types
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services, other, …
## $ Fjob       <fct> teacher, other, other, services, other, other, other, teach…
## $ reason     <fct> course, course, other, home, home, reputation, home, home, …
## $ guardian   <fct> mother, father, mother, mother, father, mother, mother, mot…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

The dataset consists of 35 variables and 370 observations.

2. Variables relationship hypotheses

Hypothesis 1 People involved in romantic relationships are less likely to be heavy drinkers.

alc %>% group_by(romantic,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'romantic'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   romantic [2]
##   romantic high_use count
##   <fct>    <lgl>    <int>
## 1 no       FALSE      173
## 2 no       TRUE        78
## 3 yes      FALSE       86
## 4 yes      TRUE        33
g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("romantic")+geom_bar()

From both tabular and visual summary it is obvious that within both groups (single and people in relationships) the fraction of those who consume large amounts of

Hypothesis 2 People who go out with friends more often tend to drink more.

# group by frequency of going out and alcohol coonsumption
alc %>% group_by(goout,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'goout'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups:   goout [5]
##    goout high_use count
##    <int> <lgl>    <int>
##  1     1 FALSE       19
##  2     1 TRUE         3
##  3     2 FALSE       82
##  4     2 TRUE        15
##  5     3 FALSE       97
##  6     3 TRUE        23
##  7     4 FALSE       40
##  8     4 TRUE        38
##  9     5 FALSE       21
## 10     5 TRUE        32
g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("goout")+geom_bar()

Indeed, it is clearly seen that those who go out with friends more often, tend to drink more.

Hypothesis 3 People who spend more time studying, don’t have time for drinking.

alc %>% group_by(studytime,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'studytime'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 3
## # Groups:   studytime [4]
##   studytime high_use count
##       <int> <lgl>    <int>
## 1         1 FALSE       56
## 2         1 TRUE        42
## 3         2 FALSE      128
## 4         2 TRUE        57
## 5         3 FALSE       52
## 6         3 TRUE         8
## 7         4 FALSE       23
## 8         4 TRUE         4

The correlation can be observed here as well. As you can see, within group of students who spend less than 2 hours for studying, there’s almost a half of “heavy drinkers”, whereas for those who study more than 10 hours, it decreases to ~14 %.

g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("studytime")+geom_bar()

Hypothesis 4 People who consume large amounts of alcohol, have lower grades.

To analyze grades, we first take an average of three grades and then compare the distributions across groups of people who consume alcohol a lot and a bit.

alc$grades <- rowMeans(alc[, c("G1","G2","G3")])
g <- ggplot(data = alc, aes(x=high_use, y = grades))
g + geom_boxplot()

From boxplot graph it is not obvious whether alcohol consumption have an effect on grades. It seems that for low-drinkers the distribution is broader and the mean value is higher whereas for high-drinkers the distribution of grades is tilted towards lower values.

alc %>% group_by(high_use) %>% summarise(mean(grades))
## # A tibble: 2 × 2
##   high_use `mean(grades)`
##   <lgl>             <dbl>
## 1 FALSE              11.8
## 2 TRUE               10.9

Mean values of grades differ not that much. Let’s take a look at each distribution separately.

g <- ggplot(data = alc, aes(x=grades))
g + geom_density(alpha=0.2)+facet_grid(~high_use)

Here it seems that for low-drinkers there is a bimodal distribution. The mode with higher grades is absent in case of high-drinkers. However we’ll see whether the mentioned parameters are significant or not in regression model.

3. Logistic regression model

To begin with we create a model of high_use as a target variable and those mentioned in the hypotheses section as explanatory variables (grades, studytime, goout, romantic),

# create a logistic model and it's summary
m <- glm(high_use ~ grades + studytime + goout + romantic, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ grades + studytime + goout + romantic, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7852  -0.7805  -0.5395   0.9537   2.5739  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.43474    0.73913  -1.941  0.05224 .  
## grades      -0.05210    0.04664  -1.117  0.26400    
## studytime   -0.58343    0.16813  -3.470  0.00052 ***
## goout        0.72552    0.11848   6.123 9.16e-10 ***
## romanticyes -0.19076    0.27253  -0.700  0.48396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 385.84  on 365  degrees of freedom
## AIC: 395.84
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2381778 0.05452509 0.9971120
## grades      0.9492380 0.86558442 1.0397804
## studytime   0.5579787 0.39679174 0.7685481
## goout       2.0658027 1.64754827 2.6242897
## romanticyes 0.8263314 0.48042098 1.4021977

When the odds ratio is close to 1, it means that changing explanatory variable the odds of target variable does not change. Grades and romantic relationships with OR of 0.95 and 0.83 respectively do not correlate with alcohol consumption. Meanwhile each point of ‘goout’ variable doubles alcohol consumption odds. On the contrary each point of studytime decreases alcohol consumption odds by half.

4. Predictive power of the model

Let’s use our model for prediction of alcohol consumption level by calculating the probability of high_use with predict() function.

#calculate the probabilities of high_use using logistic model
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)

#prediction is true if probability > 0.5
alc <- mutate(alc, prediction = probability>0.5) 
alc %>% group_by(high_use,prediction) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   high_use [2]
##   high_use prediction count
##   <lgl>    <lgl>      <int>
## 1 FALSE    FALSE        238
## 2 FALSE    TRUE          21
## 3 TRUE     FALSE         70
## 4 TRUE     TRUE          41

From the summarizing table we can conclude that when model predicts low level of alcohol consumption (FALSE), it is correct in 238 / (238 + 70) = 77 %. When high - 41 / (41 + 21) = 66 %. Overall the model prediction is wrong in (21 + 70) / (238 + 21 + 70 + 41) = 25 %. Overall if we compare the model with simple guessing strategy (0.5 % probability), it produces 25 % less wrong predictions which is not bad considering small number of variables taken.


Clustering and classification

Introduction

The dataset we are exploring today contains information about different suburban areas near Boston. It includes information about environmental conditions of touns, population, characteristics of housing and transportation, crime rate etc. The full description can be found here.

1. Loading the dataset

Let’s load necessary libraries and the Boston dataset

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyverse)
# load the data
data("Boston")
#explore structure and dimensions 
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

The dataset consists of 14 variables and 506 observations.

2. Visualization of variables and correlations

Analyzing datasets it is always useful to have a look at variables distributions. To do that, we first convert data from wide format to long format (variable-value dictionary) using melt() function from reshape library. Then we visualize distributions using ggplot2 library.

library(ggplot2)
library(reshape)
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## The following object is masked from 'package:dplyr':
## 
##     rename
melt.boston <- melt(Boston)
## Using  as id variables
ggplot(data = melt.boston, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")

library(MASS)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) 

# print the correlation matrix
cor_matrix %>% 
  round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

3. Scaling the dataset

To scale the dataset we substract each variable mean value and divide by standard deviation (SD).

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

As a result of scaling the dataset, we derivatized variables in a following way. Mean value is zero and observations that deviate from mean by SD, is set to ±1.

4. Turning crime rate into categorical variable

To build up a model for prediction of crime rate we have to first turn it into categorical variable. To do that we make a quantile vector with the corresponding function.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Then we break the crim variable into 5-level factor, add substitute the initial variable in the scaled dataset.

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

5. LDA model

Then we have to split the dataset into train and test samples.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2351485 0.2574257 0.2698020 
## 
## Group means:
##                  zn      indus        chas        nox          rm        age
## low       1.0123430 -0.9073585 -0.10828322 -0.8660377  0.44660552 -0.9047104
## med_low  -0.0841951 -0.2592071 -0.06511327 -0.5467772 -0.13296553 -0.2786747
## med_high -0.3701526  0.1799148  0.10623826  0.3931296  0.06484032  0.3709025
## high     -0.4872402  1.0169738 -0.01948777  1.0811178 -0.43031968  0.8271672
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8988160 -0.6971471 -0.7384480 -0.47360326  0.37976910 -0.78102311
## med_low   0.3361144 -0.5514983 -0.4638385 -0.02467884  0.29874707 -0.13592408
## med_high -0.3624551 -0.4208892 -0.3197575 -0.28014337  0.07537608  0.05185534
## high     -0.8612369  1.6395837  1.5150965  0.78247128 -0.73425852  0.89245605
##                 medv
## low       0.51222033
## med_low  -0.01329549
## med_high  0.16841437
## high     -0.71240603
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.12537068  0.890252276 -0.89750550
## indus   -0.02734880 -0.182028889  0.28586923
## chas    -0.00108300  0.004845711  0.09436041
## nox      0.42588508 -0.606616368 -1.49180245
## rm       0.03052905  0.009392296 -0.19302088
## age      0.23346708 -0.309393987  0.09607816
## dis     -0.11762934 -0.353654987  0.26980335
## rad      3.10926409  0.918249406 -0.01032002
## tax      0.04543014 -0.101494877  0.58776961
## ptratio  0.14304239  0.041053817 -0.28769819
## black   -0.09752141  0.037421754  0.12175008
## lstat    0.15775237 -0.348573278  0.23150279
## medv     0.05554897 -0.531478563 -0.23204900
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9505 0.0377 0.0118
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

From the LDA plot we can clearly see that observations with high crime rate are very well separated. Others are overlapping which will clearly affect the prediction power of the model.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15      15        1    0
##   med_low    5      24        2    0
##   med_high   0       5       16    1
##   high       0       0        0   18

Indeed, as you can see, the best results are obtained for high cluster, whereas for others there are a lot of false predictions.

data('Boston')
bst <- scale(Boston) 
bst <- as.data.frame(bst) 
dist_eu <- dist(bst) 
km <- kmeans(x = bst, centers = 3)

Then we determine the optimal number of clusters by calculating the total sum of squares.

set.seed(21) 
#set the max clusters numbers
k_max <- 10 

#calculate the total sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bst, k)$tot.withinss}) 
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line') 

The total WCSS decreases dramatically at 2 which means it is the optimal number of clusters.

Lets run the code again and plot it:

km <-kmeans(bst, centers = 2)
pairs(bst, col = km$cluster)

Super bonus 3D plots

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

First we set the color to be the crime classes of the train set. 

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

Second we make the color be defined by the clusters of the k-means

km1 <- dplyr::select(train, -crime)
km1 <- kmeans(km1, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km1$cluster)

The clusters are more or less similar, however observations with higher crime rate are better separated when using crime class.


Dimensionality reduction techniques

Introduction

This week we are working with dataset derived from the United Nations Development Programme data. It describes certain parameters related to quality of life and gender equality in different countries.

It includes information on distribution of men and women in such fields as education, politics, labor, data on adolescent birthrate, maternal mortality, GNI per capita and life expectancy. Full description can be found here

library(tidyverse)
library(GGally)
library(corrplot)
human <- read.csv("./data/human.csv", row.names = 1)

1. Overview of the data

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human)

cor(human) %>% corrplot(method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Speaking about the variables distribution, some of them are distributed normally with more or less symmetrical bell shape (female fraction in parliament, ratios in labor and secondary education, years of schooling). Fortunately the data is skewed to higher life expectancy and lower maternal mortality rate.

Some strong correlations are observed between variables.
In countries with higher GNI per capita life expectancy and years of schooling are also higher. On the contrary in more “poor” countries there is high adolescent birth rate and high level of maternal death rate. That fact can be possibly explained with lower level of public medical services in countries with low GNI. Obviously in countries with high adolescent birth rate fraction of women in secondary education is lower. Possibly it is due too these countries are generally characterized by highly traditionalist patriarchal society.

2. Principal component analysis

PCA is a statistical method used to reduce the dimensionality of large datasets. Principle components are calculated in such a way that they represent the highest possible variance in the dataset.

However the method is sensitive to variables variation, therefore it is important to standartize the data. To illustrate that we perform PCA on both scaled and raw data.

2.1 Non-standardized data

# perform principal component analysis (with the SVD method) on raw dataset
pca_human <- prcomp(human)

summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

As you can see from the summary, almost 100 % of the variance is from PC1 component, and the GNI is parallel to PC1 axis. The reason is that the GNI variable is in a scale of thousands and influences calculations a lot.

biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

2.2 PCA of scaled data

To get rid of the influence of data variation, we scale the dataset as in previous week assignment (substract each variable mean value and divide by standard deviation). By doing that we make each variable mean value a zero and scale the standart deviation making it 1.

# scale the dataset
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

As you can see now PC1 contributes to around half of the variation and the PC2 – to 16 %.

Information we get from the plot is pretty similar to that we got by simple cross-correlation. Orthogonal vectors here represent lack of correlation. The closer an angle is to 0, the higher is correlation. If it’s around 180 degrees, there is a negative correlation.
We clearly see that GNI per capita has strong correlation with life expectancy and quality and availability of education. Those variables negatively correlate with high level of adolescent births and maternal death.

Interestingly mentioned variables are almost independent on the distribution of males and females on labor market and in politics.

3. Multiple correspondence analysis

To perform the MCA we use the dataset describing people’s habits in tea consumption.

The variables are mostly self-explanatory. They describe how respondents drink tea (18 variables), what are their product’s perception (12 variables) and some personal details. Overall there are 35 factors and one integer (age). There are 300 observations in the dataset.

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
view(tea)

summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
library(FactoMineR)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "breakfast", "dinner")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.230   0.222   0.194   0.183   0.173   0.159   0.140
## % of var.             13.817  13.334  11.656  10.951  10.381   9.542   8.379
## Cumulative % of var.  13.817  27.151  38.807  49.758  60.139  69.681  78.060
##                        Dim.8   Dim.9  Dim.10
## Variance               0.133   0.126   0.106
## % of var.              7.980   7.584   6.375
## Cumulative % of var.  86.040  93.625 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  |  0.329  0.157  0.099 | -0.187  0.053  0.032 | -0.022
## 2                  |  0.948  1.300  0.557 | -0.330  0.163  0.067 |  0.001
## 3                  | -0.783  0.888  0.216 |  0.569  0.485  0.114 |  0.704
## 4                  | -1.129  1.845  0.446 |  0.238  0.085  0.020 |  0.742
## 5                  |  0.138  0.027  0.029 | -0.146  0.032  0.033 | -0.146
## 6                  | -0.783  0.888  0.216 |  0.569  0.485  0.114 |  0.704
## 7                  |  0.138  0.027  0.029 | -0.146  0.032  0.033 | -0.146
## 8                  |  0.576  0.480  0.209 | -0.001  0.000  0.000 |  0.003
## 9                  |  0.558  0.450  0.217 | -0.467  0.327  0.152 |  0.088
## 10                 |  0.822  0.978  0.515 |  0.296  0.131  0.067 |  0.113
##                       ctr   cos2  
## 1                   0.001  0.000 |
## 2                   0.000  0.000 |
## 3                   0.851  0.175 |
## 4                   0.945  0.193 |
## 5                   0.036  0.032 |
## 6                   0.851  0.175 |
## 7                   0.036  0.032 |
## 8                   0.000  0.000 |
## 9                   0.013  0.005 |
## 10                  0.022  0.010 |
## 
## Categories (the 10 first)
##                       Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## black              |  1.176 24.682  0.453 11.634 |  0.443  3.623  0.064  4.379
## Earl Grey          | -0.371  6.405  0.248 -8.613 | -0.377  6.839  0.256 -8.744
## green              | -0.468  1.740  0.027 -2.842 |  1.210 12.071  0.181  7.353
## alone              | -0.182  1.551  0.061 -4.279 |  0.372  6.743  0.257  8.764
## lemon              | -0.457  1.666  0.026 -2.781 | -0.513  2.168  0.032 -3.117
## milk               |  0.604  5.543  0.097  5.384 | -0.965 14.669  0.248 -8.604
## other              |  1.384  4.157  0.059  4.208 |  0.577  0.750  0.010  1.756
## tea bag            | -0.199  1.629  0.052 -3.941 | -0.301  3.851  0.118 -5.952
## tea bag+unpackaged |  0.225  1.143  0.023  2.623 |  0.130  0.399  0.008  1.521
## unpackaged         |  0.355  1.094  0.017  2.267 |  1.081 10.524  0.159  6.905
##                       Dim.3    ctr   cos2 v.test  
## black              |  0.212  0.953  0.015  2.100 |
## Earl Grey          | -0.015  0.012  0.000 -0.347 |
## green              | -0.388  1.424  0.019 -2.361 |
## alone              | -0.174  1.691  0.056 -4.104 |
## lemon              | -0.078  0.058  0.001 -0.476 |
## milk               | -0.015  0.004  0.000 -0.130 |
## other              |  4.162 44.580  0.536 12.656 |
## tea bag            |  0.013  0.008  0.000  0.251 |
## tea bag+unpackaged |  0.470  5.938  0.101  5.490 |
## unpackaged         | -1.287 17.059  0.226 -8.220 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.454 0.300 0.028 |
## How                | 0.178 0.324 0.540 |
## how                | 0.053 0.197 0.268 |
## sugar              | 0.248 0.218 0.002 |
## breakfast          | 0.286 0.216 0.000 |
## dinner             | 0.163 0.078 0.327 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The distance between points here indicates measure of similarity between variables. We can conclude that green tea is very unlikely to come with milk, whereas Earl Grey commonly goes with sugar and lemon. (more chapters to be added similarly as we proceed with the course!)